30-Day Readmission Predicition with ICU Monitoring Data

1 Overview and Background

With the development of modern large scale computing technologies, machine learing techniques have been increasingly applied to large scale health care datasets, such as clinical data and biological data. Developing and optimizing health-related systems are no doubt promising as our human beings' health conditions always come first. Among numerous research topics, predicting hospital 30-Day readmission is of our interests. If we are able to accurately predict readmission, we can identify patients at high risk and prevent early discharging them. As a result, patients will be given more precision care and hospitals can better manage resources, reduce waste room and save costs. Existing works have already explored this topic [1-4]. Before designing the model, extracting most useful features is of great importance. By far, the features that have been taken into consideration includes, demographic characteristics: age, gender, race etc; lab results, chart events etc. monitored in ICU [1]; electronic health record data: length of stay, number of admissions, admission type etc. [2,3]. Machine learning techniques that have been used include random forest [1], neural networks [4] etc. Our project goal is to extract relevant features from MIMIC database and design classification models to optimize performance.

References:

[1] Yaron Blinder. Predicting 30-day ICU readmissions from the MIMIC-III database. https://github.com/YaronBlinder/MIMIC-III_readmission. 2017.

[2] Oanh Kieu Nguyen et al. Predicting all-cause readmissions using electronic health record data from the entire hospitalization: model development and comparison". In: Journal of hospital medicine 11.7 (2016), pp. 473-480.

[3] Frida Kareliusson, Lina De Geer, and Anna Oscarsson Tibblin. Risk prediction of ICU readmission in a mixed surgical and medical population". In: Journal of intensive care 3.1 (2015), p. 30.

[4] Ricardo Bento Afonso. Feature Extraction and Selection for Prediction of ICU Patient's Readmission Using Artificial Neural Networks". In: (2013).

2 Data Processing

  • Generate lables: For each patient, we compute the intervals between his two sequent admissions. If the interval is shorter than 30 days, the readmission label for the encounter is 1; otherwise the readmission label is 0.
  • Create feature matrix: This set of features mainly contains ICU monitoring data. Based on literature study, we find potentially predictive features, such as the glucose, heart rate and body temperature. These features are recorded in the CHARTEVENTS table and the LABEVENTS table and their values can be extracted via ITEMID. We pre-filter the values, for example, the value of heart rate is limited within 300. Then we calculate the minimum, the maximum and the mean value of each feature. We fill the missing values with column mean. We also add some demographic features, such as age and genders.
In [1]:
import sqlite3 
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix

import time
import math

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

import warnings
warnings.filterwarnings('ignore')
In [2]:
plt.rcParams['figure.dpi'] = 300
In [3]:
conn = sqlite3.connect('C:/Users/culiubaicaiya/Downloads/bme590 course note/data/MIMIC.db')

2.1 Generate Labels

In [4]:
readmit_df = pd.read_sql('''SELECT * FROM
                           (SELECT SUBJECT_ID, HADM_ID, ADMITTIME, DISCHTIME, 
                                   RANK() OVER(PARTITION BY SUBJECT_ID ORDER BY ADMITTIME) AS ADMITTIME_RANK 
                            FROM ADMISSIONS) a
                            LEFT JOIN 
                           (SELECT SUBJECT_ID, MAX(ADMITTIME_RANK) AS MAX_ADMITTIME_RANK FROM
                           (SELECT SUBJECT_ID, HADM_ID, ADMITTIME, DISCHTIME, 
                                   RANK() OVER(PARTITION BY SUBJECT_ID ORDER BY ADMITTIME) AS ADMITTIME_RANK 
                            FROM ADMISSIONS)
                            GROUP BY SUBJECT_ID) b
                            ON a.SUBJECT_ID = b.SUBJECT_ID''', conn)

readmit_df[['ADMITTIME','DISCHTIME']] = readmit_df[['ADMITTIME','DISCHTIME']].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S'))

next_admittime = readmit_df['ADMITTIME'].values[1:]
next_admittime = np.append(next_admittime, next_admittime[0])
readmit_df['NEXT_ADMITTIME'] = next_admittime

readmit_df['30_READMIT'] = -1

readmit_df.loc[readmit_df['ADMITTIME_RANK'] == readmit_df['MAX_ADMITTIME_RANK'], '30_READMIT'] = 0

readmit_sub_df = readmit_df[readmit_df['30_READMIT']==-1]
interval = (readmit_sub_df['NEXT_ADMITTIME'] - readmit_sub_df['DISCHTIME']).dt.days.values
readmit_df.loc[readmit_df['30_READMIT']==-1, 'INTERVAL'] = interval 

readmit_df.loc[readmit_df['INTERVAL']<=30, '30_READMIT'] = 1
readmit_df.loc[readmit_df['INTERVAL']>30, '30_READMIT'] = 0
In [5]:
readmit_df.head()
Out[5]:
SUBJECT_ID HADM_ID ADMITTIME DISCHTIME ADMITTIME_RANK SUBJECT_ID MAX_ADMITTIME_RANK NEXT_ADMITTIME 30_READMIT INTERVAL
0 2 163353 2138-07-17 19:04:00 2138-07-21 15:48:00 1 2 1 2101-10-20 19:08:00 0 NaN
1 3 145834 2101-10-20 19:08:00 2101-10-31 13:58:00 1 3 1 2191-03-16 00:28:00 0 NaN
2 4 185777 2191-03-16 00:28:00 2191-03-23 18:41:00 1 4 1 2103-02-02 04:31:00 0 NaN
3 5 178980 2103-02-02 04:31:00 2103-02-04 12:15:00 1 5 1 2175-05-30 07:15:00 0 NaN
4 6 107064 2175-05-30 07:15:00 2175-06-15 16:00:00 1 6 1 2121-05-23 15:05:00 0 NaN
In [6]:
readmit_label_df = readmit_df[['HADM_ID', '30_READMIT']]
In [7]:
readmit_label_df.head()
Out[7]:
HADM_ID 30_READMIT
0 163353 0
1 145834 0
2 185777 0
3 178980 0
4 107064 0
In [8]:
readmit_label_df.shape
Out[8]:
(58976, 2)

2.2 Create Feature Matrix

2.2.1 Numerical Features

In [9]:
le_df = pd.read_sql(''' select le.subject_id, le.hadm_id
                        , min(case when le.itemid=51006 then le.valuenum else null end) as urea_N_min
                        , max(case when le.itemid=51006 then le.valuenum else null end) as urea_N_max
                        , avg(case when le.itemid=51006 then le.valuenum else null end) as urea_N_mean
                        , min(case when le.itemid=51265 then le.valuenum else null end) as platelets_min
                        , max(case when le.itemid=51265 then le.valuenum else null end) as platelets_max
                        , avg(case when le.itemid=51265 then le.valuenum else null end) as platelets_mean
                        , max(case when le.itemid=50960 then le.valuenum else null end) as magnesium_max
                        , min(case when le.itemid=50862 then le.valuenum else null end) as albumin_min
                        , min(case when le.itemid=50893 then le.valuenum else null end) as calcium_min
                        from labevents le
                        where hadm_id is not null
                        group by 1,2 order by 1,2
''', conn)
In [10]:
le_df.head()
Out[10]:
SUBJECT_ID HADM_ID urea_N_min urea_N_max urea_N_mean platelets_min platelets_max platelets_mean magnesium_max albumin_min calcium_min
0 2 163353.0 NaN NaN NaN 5.0 302.0 153.500000 NaN NaN NaN
1 3 145834.0 15.0 53.0 28.133333 121.0 359.0 210.866667 3.5 1.8 5.8
2 4 185777.0 9.0 20.0 16.125000 201.0 388.0 286.428571 2.2 2.8 7.7
3 5 178980.0 NaN NaN NaN 309.0 309.0 309.000000 NaN NaN NaN
4 6 107064.0 16.0 91.0 53.941176 152.0 331.0 247.947368 3.3 2.7 6.8
In [11]:
chev_df = pd.read_sql('''select hadm_id
                        , min(case when itemid in (615,618,220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as RespRate_Min
                        , max(case when itemid in (615,618,220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as RespRate_Max
                        , avg(case when itemid in (615,618,220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as RespRate_Mean
                        , min(case when itemid in (807,811,1529,3745,3744,225664,220621,226537) and valuenum > 0 then valuenum else null end) as Glucose_Min
                        , max(case when itemid in (807,811,1529,3745,3744,225664,220621,226537) and valuenum > 0 then valuenum else null end) as Glucose_Max
                        , avg(case when itemid in (807,811,1529,3745,3744,225664,220621,226537) and valuenum > 0 then valuenum else null end) as Glucose_Mean
                        , min(case when itemid in (211,220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as HR_min
                        , max(case when itemid in (211,220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as HR_max
                        , round(cast(avg(case when itemid in (211,220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as numeric), 2) as HR_mean
                        , min(case when itemid in (51,442,455,6701,220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as SysBP_min
                        , max(case when itemid in (51,442,455,6701,220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as SysBP_max
                        , round(cast(avg(case when itemid in (51,442,455,6701,220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as numeric), 2) as SysBP_mean
                        , min(case when itemid in (8368,8440,8441,8555,220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as DiasBP_min
                        , max(case when itemid in (8368,8440,8441,8555,220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as DiasBP_max
                        , round(cast(avg(case when itemid in (8368,8440,8441,8555,220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as numeric), 2) as DiasBP_mean
                        , min(case when itemid in (223761,678) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8
                                   when itemid in (223762,676)  and valuenum > 10 and valuenum < 50 then valuenum else null end) as temp_min
                        , max(case when itemid in (223761,678) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8
                                   when itemid in (223762,676)  and valuenum > 10 and valuenum < 50 then valuenum else null end) as temp_max
                        , round(cast(avg(case when itemid in (223761,678) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8
                                   when itemid in (223762,676)  and valuenum > 10 and valuenum < 50 then valuenum else null end) as numeric), 2) as temp_mean
                        from chartevents
                        where itemid in
                        (
                          615,618,220210,224690, --- RespRate
                          807,811,1529,3745,3744,225664,220621,226537, --- Glucose
                          211,220045,---HR
                          51,442,455,6701,220179,220050,---SysBP
                          8368,8440,8441,8555,220180,220051,--DiasBP
                          223761,678,223762,676--Temp
                        )
                        and hadm_id is not null
                        group by 1
''', conn)
In [12]:
chev_df.head()
Out[12]:
HADM_ID RespRate_Min RespRate_Max RespRate_Mean Glucose_Min Glucose_Max Glucose_Mean HR_min HR_max HR_mean SysBP_min SysBP_max SysBP_mean DiasBP_min DiasBP_max DiasBP_mean temp_min temp_max temp_mean
0 100001 14 22 17.357143 100 93 179.382353 101 131 115.99 110 209 174.19 100 111 104.63 None None NaN
1 100003 11 23 15.818182 113 68 93.000000 100 104 102.00 100 149 121.66 None None NaN None None NaN
2 100006 11 29 18.762295 106 220 161.227273 100 136 112.53 113 167 140.51 100 113 105.00 None None NaN
3 100007 11 27 19.312500 102 77 127.333333 100 111 106.17 108 164 140.75 114 114 114.00 None None NaN
4 100009 10 37 24.145161 100 98 153.777778 None None NaN 100 138 115.24 None None NaN None None NaN
In [13]:
chev_df = chev_df.astype('float64')
In [14]:
opev_df = pd.read_sql('''select hadm_id
    , min(value) as urine_min
    , max(value) as urine_max
    , round(cast(avg(value) as numeric)) as urine_mean
    from outputevents
    where itemid in (40055,226559)
    and hadm_id is not null
    group by 1
''', conn)
In [15]:
opev_df.head()
Out[15]:
HADM_ID urine_min urine_max urine_mean
0 100003.0 50.0 360.0 152.0
1 100006.0 25.0 600.0 153.0
2 100007.0 0.0 300.0 74.0
3 100009.0 30.0 500.0 160.0
4 100010.0 10.0 150.0 64.0
In [16]:
age_df = pd.read_sql(''' SELECT a.HADM_ID, ADMITTIME, b. DOB
                             FROM ADMISSIONS a
                             INNER JOIN PATIENTS b on a.SUBJECT_ID = b.SUBJECT_ID
''', conn)

age_df[['ADMITTIME','DOB']] =age_df[['ADMITTIME','DOB']].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S'))

idx_list = []
for idx in range(age_df.shape[0]):
    age = (age_df['ADMITTIME'][idx] - age_df['DOB'][idx]).days //365
    if age != 0 and age<= 120:
        idx_list.append(idx)

age_df = age_df.loc[idx_list, :] 

age_df['AGE'] = (age_df['ADMITTIME'] - age_df['DOB']).dt.days.values // 365

age_df = age_df.drop(['ADMITTIME', 'DOB'], axis=1)
In [17]:
age_df.head()
Out[17]:
HADM_ID AGE
0 165315 64
1 152223 71
2 124321 75
3 161859 39
4 129635 58
In [18]:
numfeat_df = pd.merge(le_df, opev_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
numfeat_df = pd.merge(numfeat_df, age_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
numfeat_df = pd.merge(numfeat_df, chev_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
numfeat_df = numfeat_df.drop(['SUBJECT_ID'], axis=1)

# fill missing values with column mean
numfeat_mean = numfeat_df.mean().to_dict()
numfeat_df = numfeat_df.fillna(value=numfeat_mean)
In [19]:
numfeat_df.head()
Out[19]:
HADM_ID urea_N_min urea_N_max urea_N_mean platelets_min platelets_max platelets_mean magnesium_max albumin_min calcium_min ... HR_mean SysBP_min SysBP_max SysBP_mean DiasBP_min DiasBP_max DiasBP_mean temp_min temp_max temp_mean
0 145834.0 15.0 53.0 28.133333 121.0 359.0 210.866667 3.5 1.8 5.8 ... 122.57 100.0 217.0 124.99 103.000000 28.000000 51.670000 36.200001 37.599998 36.950000
1 107064.0 16.0 91.0 53.941176 152.0 331.0 247.947368 3.3 2.7 6.8 ... 102.00 125.0 187.0 153.75 80.612157 78.511883 78.439944 36.047597 37.930744 37.049795
2 150750.0 16.0 33.0 22.714286 221.0 330.0 274.250000 3.0 2.9 7.8 ... 105.50 100.0 240.0 151.47 100.000000 194.000000 119.170000 36.047597 37.930744 37.049795
3 112213.0 28.0 41.0 32.214286 51.0 706.0 250.235294 2.9 2.2 7.9 ... 102.67 100.0 231.0 153.12 100.000000 111.000000 104.200000 36.047597 37.930744 37.049795
4 143045.0 13.0 22.0 16.200000 74.0 273.0 161.800000 2.4 3.9 8.7 ... 109.95 100.0 189.0 133.94 80.612157 78.511883 78.439944 36.299999 38.599998 37.510000

5 rows × 32 columns

2.2.2 Categorical Features

In [20]:
catfeat_df = pd.read_sql(''' SELECT a.HADM_ID, INSURANCE, MARITAL_STATUS, b.GENDER, c.FIRST_CAREUNIT, LAST_CAREUNIT
                           FROM ADMISSIONS a
                           INNER JOIN PATIENTS b on a.SUBJECT_ID = b.SUBJECT_ID
                           INNER JOIN ICUSTAYS c on a.HADM_ID = c.HADM_ID
''', conn)
catfeat_df = pd.get_dummies(catfeat_df)
In [21]:
catfeat_df.head()
Out[21]:
HADM_ID INSURANCE_Government INSURANCE_Medicaid INSURANCE_Medicare INSURANCE_Private INSURANCE_Self Pay MARITAL_STATUS_DIVORCED MARITAL_STATUS_LIFE PARTNER MARITAL_STATUS_MARRIED MARITAL_STATUS_SEPARATED ... FIRST_CAREUNIT_MICU FIRST_CAREUNIT_NICU FIRST_CAREUNIT_SICU FIRST_CAREUNIT_TSICU LAST_CAREUNIT_CCU LAST_CAREUNIT_CSRU LAST_CAREUNIT_MICU LAST_CAREUNIT_NICU LAST_CAREUNIT_SICU LAST_CAREUNIT_TSICU
0 165315 0 0 0 1 0 0 0 1 0 ... 1 0 0 0 0 0 1 0 0 0
1 152223 0 0 1 0 0 0 0 1 0 ... 0 0 0 0 0 1 0 0 0 0
2 124321 0 0 1 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 1 0
3 161859 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
4 129635 0 0 0 1 0 0 0 1 0 ... 0 0 0 0 1 0 0 0 0 0

5 rows × 27 columns

2.3 Create Design Matrix

In [22]:
des_mat = pd.merge(readmit_label_df, catfeat_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
des_mat = pd.merge(des_mat, numfeat_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
In [23]:
des_mat.head()
Out[23]:
HADM_ID 30_READMIT INSURANCE_Government INSURANCE_Medicaid INSURANCE_Medicare INSURANCE_Private INSURANCE_Self Pay MARITAL_STATUS_DIVORCED MARITAL_STATUS_LIFE PARTNER MARITAL_STATUS_MARRIED ... HR_mean SysBP_min SysBP_max SysBP_mean DiasBP_min DiasBP_max DiasBP_mean temp_min temp_max temp_mean
0 145834 0 0 0 1 0 0 0 0 1 ... 122.57 100.0 217.0 124.99 103.000000 28.000000 51.670000 36.200001 37.599998 36.950000
1 107064 0 0 0 1 0 0 0 0 1 ... 102.00 125.0 187.0 153.75 80.612157 78.511883 78.439944 36.047597 37.930744 37.049795
2 150750 0 0 1 0 0 0 0 0 0 ... 105.50 100.0 240.0 151.47 100.000000 194.000000 119.170000 36.047597 37.930744 37.049795
3 112213 0 0 0 1 0 0 0 0 1 ... 102.67 100.0 231.0 153.12 100.000000 111.000000 104.200000 36.047597 37.930744 37.049795
4 143045 0 0 1 0 0 0 0 0 0 ... 109.95 100.0 189.0 133.94 80.612157 78.511883 78.439944 36.299999 38.599998 37.510000

5 rows × 59 columns

In [24]:
des_mat.columns
Out[24]:
Index(['HADM_ID', '30_READMIT', 'INSURANCE_Government', 'INSURANCE_Medicaid',
       'INSURANCE_Medicare', 'INSURANCE_Private', 'INSURANCE_Self Pay',
       'MARITAL_STATUS_DIVORCED', 'MARITAL_STATUS_LIFE PARTNER',
       'MARITAL_STATUS_MARRIED', 'MARITAL_STATUS_SEPARATED',
       'MARITAL_STATUS_SINGLE', 'MARITAL_STATUS_UNKNOWN (DEFAULT)',
       'MARITAL_STATUS_WIDOWED', 'GENDER_F', 'GENDER_M', 'FIRST_CAREUNIT_CCU',
       'FIRST_CAREUNIT_CSRU', 'FIRST_CAREUNIT_MICU', 'FIRST_CAREUNIT_NICU',
       'FIRST_CAREUNIT_SICU', 'FIRST_CAREUNIT_TSICU', 'LAST_CAREUNIT_CCU',
       'LAST_CAREUNIT_CSRU', 'LAST_CAREUNIT_MICU', 'LAST_CAREUNIT_NICU',
       'LAST_CAREUNIT_SICU', 'LAST_CAREUNIT_TSICU', 'urea_N_min', 'urea_N_max',
       'urea_N_mean', 'platelets_min', 'platelets_max', 'platelets_mean',
       'magnesium_max', 'albumin_min', 'calcium_min', 'urine_min', 'urine_max',
       'urine_mean', 'AGE', 'RespRate_Min', 'RespRate_Max', 'RespRate_Mean',
       'Glucose_Min', 'Glucose_Max', 'Glucose_Mean', 'HR_min', 'HR_max',
       'HR_mean', 'SysBP_min', 'SysBP_max', 'SysBP_mean', 'DiasBP_min',
       'DiasBP_max', 'DiasBP_mean', 'temp_min', 'temp_max', 'temp_mean'],
      dtype='object')
In [25]:
des_mat.shape
Out[25]:
(42228, 59)
In [26]:
y = des_mat['30_READMIT'].values
In [27]:
len(y)
Out[27]:
42228
In [28]:
len(y[y==1])
Out[28]:
2605
In [29]:
X = des_mat.drop(['HADM_ID', '30_READMIT'], axis=1).values
In [30]:
X.shape
Out[30]:
(42228, 57)
In [31]:
X_catfeat = X[:, :26]
In [32]:
X_catfeat
Out[32]:
array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])
In [33]:
X_numfeat = X[:, 26:]
In [34]:
X_numfeat
Out[34]:
array([[15.        , 53.        , 28.13333333, ..., 36.20000076,
        37.59999847, 36.95      ],
       [16.        , 91.        , 53.94117647, ..., 36.04759729,
        37.93074379, 37.04979546],
       [16.        , 33.        , 22.71428571, ..., 36.04759729,
        37.93074379, 37.04979546],
       ...,
       [34.        , 56.        , 47.5       , ..., 36.04759729,
        37.93074379, 37.04979546],
       [10.        , 13.        , 11.5       , ..., 36.04759729,
        37.93074379, 37.04979546],
       [13.        , 32.        , 20.        , ..., 36.04759729,
        37.93074379, 37.04979546]])

3 Data Exploration

We visualized the readmission proportion of the two datasets. Apparently, the datasets have highly imbalanced classes. We also visualized the scatter plot of the numerical data. From the scatter plot, the data is hardly separable, which increases the difficulty for model design.

In [35]:
H1_num = len(y[y==1])
H0_num = len(y[y==0])
total_num = len(y)
In [36]:
H1_num
Out[36]:
2605
In [37]:
labels = '30Days Readmission-Y', '30Days Readmission-N'
sizes = [H1_num/total_num*100, H0_num/total_num*100]
colors = ['pink', 'skyblue']
explode = (0.05, 0) 
fig = plt.figure(figsize=(6,6))
plt.pie(sizes, explode=explode, colors=colors, autopct='%1.1f%%', startangle=90)
plt.legend(labels, fontsize='small', loc='center left', bbox_to_anchor=(1, 0, 0.5, 1))
Out[37]:
<matplotlib.legend.Legend at 0x1efd2c75978>
In [38]:
full_index = np.random.permutation(np.arange(des_mat.shape[0]))
sub_index = full_index[:500]
draw_df = des_mat.iloc[sub_index, :]
draw_df = draw_df.drop(['HADM_ID'], axis=1)
draw_df.loc[draw_df['30_READMIT']==0, '30_READMIT'] = 'N'
draw_df.loc[draw_df['30_READMIT']==1, '30_READMIT'] = 'Y'
draw_num_df = pd.concat((draw_df['30_READMIT'], draw_df.iloc[:, 27:33]), axis=1)
In [39]:
num_feat_plot = sns.pairplot(draw_num_df, hue='30_READMIT', plot_kws=dict(s=30)) 

4 Modeling

We first randomly assigned the data instances into different folds to prepare for cross validation. For preprocessing, we employed down sampling and over sampling methods to relieve the class imbalance problem. For the numerical data in ICU dataset, we used z-scoring to normalize the data. Then we tested logistic regression and decision tree. To increase the performance, we applied the ensemble methods, including bagging logitic regression and random forest. Finally, we evaluated the cross validation performance and tuned parameters to optimize the performance.

In [40]:
from IPython.display import Image
PATH = "./pic/"
Image(filename = PATH + "pipeline.jpg", width=800, height=400)
Out[40]:

4.1 Logistic Regression

4.1.1 Over - Sampling

In [41]:
ros = RandomOverSampler(random_state=42)

skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
    
    Xcat_train, Xnum_train = X_train_ros[:, :26], X_train_ros[:, 26:]
    Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
    
    # standarization
    scaler = StandardScaler()
    Xnum_train_std = scaler.fit_transform(Xnum_train)
    Xnum_test_std = scaler.fit_transform(Xnum_test)
    
    X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
    X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
    
    lr_clf = LogisticRegression(random_state=0, solver='liblinear')
    lr_clf.fit(X_train_new, y_train_ros)
    y_pred_sub = lr_clf.predict(X_test_new)
    y_true = np.append(y_true, y_test)
    y_pred = np.append(y_pred, y_pred_sub)
    
lr_os_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = lr_os_conf.ravel()
lr_os_sen = tp/(tp+fn)
lr_os_spe = tn/(tn+fp)
lr_os_prec = tp/(tp+fp)
lr_os_acc = (tp+tn)/(tp+tn+fp+fn)

print ('Logistic Regression, Over_Sampling')
print()
print('confusion matrix:')
print(lr_os_conf)
print('sensitivity score:{:.4}'.format(lr_os_sen))
print('specificity score:{:.4}'.format(lr_os_spe))
print('precision score:{:.4}'.format(lr_os_prec))
print('accuracy score:{:.4}'.format(lr_os_acc))
Logistic Regression, Over_Sampling

confusion matrix:
[[16630 22993]
 [  866  1739]]
sensitivity score:0.6676
specificity score:0.4197
precision score:0.07031
accuracy score:0.435

4.1.2 Under - Sampling

In [42]:
rus = RandomUnderSampler(random_state=42)

skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
    
    Xcat_train, Xnum_train = X_train_rus[:, :26], X_train_rus[:, 26:]
    Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
    
    # standarization
    scaler = StandardScaler()
    Xnum_train_std = scaler.fit_transform(Xnum_train)
    Xnum_test_std = scaler.fit_transform(Xnum_test)
    
    X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
    X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
    
    lr_clf = LogisticRegression(random_state=0, solver='liblinear')
    lr_clf.fit(X_train_new, y_train_rus)
    y_pred_sub = lr_clf.predict(X_test_new)
    y_true = np.append(y_true, y_test)
    y_pred = np.append(y_pred, y_pred_sub)

lr_us_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = lr_us_conf.ravel()
lr_us_sen = tp/(tp+fn)
lr_us_spe = tn/(tn+fp)
lr_us_prec = tp/(tp+fp)
lr_us_acc = (tp+tn)/(tp+tn+fp+fn)

print ('Logistic Regression, Under_Sampling')
print()
print('confusion matrix:')
print(lr_us_conf)
print('sensitivity score:{:.4}'.format(lr_us_sen))
print('specificity score:{:.4}'.format(lr_us_spe))
print('precision score:{:.4}'.format(lr_us_prec))
print('accuracy score:{:.4}'.format(lr_us_acc))
Logistic Regression, Under_Sampling

confusion matrix:
[[21510 18113]
 [  977  1628]]
sensitivity score:0.625
specificity score:0.5429
precision score:0.08247
accuracy score:0.5479

4.2 Bagging Logist Regression

4.2.1 Over - Sampling

In [43]:
ros = RandomOverSampler(random_state=42)

skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
    
     
    Xcat_train, Xnum_train = X_train_ros[:, :26], X_train_ros[:, 26:]
    Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
    
    # standarization
    scaler = StandardScaler()
    Xnum_train_std = scaler.fit_transform(Xnum_train)
    Xnum_test_std = scaler.fit_transform(Xnum_test)
    
    X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
    X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
    
    base_estimator = LogisticRegression(random_state=0, solver='liblinear')
    max_features = int(math.sqrt(X_train.shape[1]))
    blr_clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=100, max_features=max_features, random_state=0)
    blr_clf.fit(X_train_new, y_train_ros)
    y_pred_sub = blr_clf.predict(X_test_new)
    y_true = np.append(y_true, y_test)
    y_pred = np.append(y_pred, y_pred_sub)
    
blr_os_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = blr_os_conf.ravel()
blr_os_sen = tp/(tp+fn)
blr_os_spe = tn/(tn+fp)
blr_os_prec = tp/(tp+fp)
blr_os_acc = (tp+tn)/(tp+tn+fp+fn)

print ('Bagging Logistic Regression, Over_Sampling')
print()
print('confusion matrix:')
print(blr_os_conf)
print('sensitivity score:{:.4}'.format(blr_os_sen))
print('specificity score:{:.4}'.format(blr_os_spe))
print('precision score:{:.4}'.format(blr_os_prec))
print('accuracy score:{:.4}'.format(blr_os_acc))
Bagging Logistic Regression, Over_Sampling

confusion matrix:
[[17258 22365]
 [  819  1786]]
sensitivity score:0.6856
specificity score:0.4356
precision score:0.07395
accuracy score:0.451

4.2.2 Under-Sampling

In [44]:
rus = RandomUnderSampler(random_state=42)

skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
    
    Xcat_train, Xnum_train = X_train_rus[:, :26], X_train_rus[:, 26:]
    Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
    
    # standarization
    scaler = StandardScaler()
    Xnum_train_std = scaler.fit_transform(Xnum_train)
    Xnum_test_std = scaler.fit_transform(Xnum_test)
    
    X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
    X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
    
    base_estimator = LogisticRegression(penalty='l1', random_state=0, solver='liblinear')
    max_features = int(math.sqrt(X_train.shape[1]))
    blr_clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=100, max_features=max_features, random_state=0)
    blr_clf.fit(X_train_new, y_train_rus)
    y_pred_sub = blr_clf.predict(X_test_new)
    y_true = np.append(y_true, y_test)
    y_pred = np.append(y_pred, y_pred_sub)
    
blr_us_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = blr_us_conf.ravel()
blr_us_sen = tp/(tp+fn)
blr_us_spe = tn/(tn+fp)
blr_us_prec = tp/(tp+fp)
blr_us_acc = (tp+tn)/(tp+tn+fp+fn)

print ('Bagging Logistic Regression, Under_Sampling')
print()
print('confusion matrix:')
print(blr_us_conf)
print('sensitivity score:{:.4}'.format(blr_us_sen))
print('specificity score:{:.4}'.format(blr_us_spe))
print('precision score:{:.4}'.format(blr_us_prec))
print('accuracy score:{:.4}'.format(blr_us_acc))
Bagging Logistic Regression, Under_Sampling

confusion matrix:
[[20917 18706]
 [  980  1625]]
sensitivity score:0.6238
specificity score:0.5279
precision score:0.07993
accuracy score:0.5338

4.3 Decision Tree

In [45]:
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    Xcat_train, Xnum_train = X_train[:, :26], X_train[:, 26:]
    Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
    
    # standarization
    scaler = StandardScaler()
    Xnum_train_std = scaler.fit_transform(Xnum_train)
    Xnum_test_std = scaler.fit_transform(Xnum_test)
    
    X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
    X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
    
    dt_clf = DecisionTreeClassifier(max_depth=7, random_state=0, class_weight='balanced')
    dt_clf.fit(X_train_new, y_train)
    y_pred_sub = dt_clf.predict(X_test_new)
    y_true = np.append(y_true, y_test)
    y_pred = np.append(y_pred, y_pred_sub)

dt_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = dt_conf.ravel()
dt_sen = tp/(tp+fn)
dt_spe = tn/(tn+fp)
dt_prec = tp/(tp+fp)
dt_acc = (tp+tn)/(tp+tn+fp+fn)

print ('Decision Tree')
print()
print('confusion matrix:')
print(dt_conf)
print('sensitivity score:{:.4}'.format(dt_sen))
print('specificity score:{:.4}'.format(dt_spe))
print('precision score:{:.4}'.format(dt_prec))
print('accuracy score:{:.4}'.format(dt_acc))
Decision Tree

confusion matrix:
[[19573 20050]
 [  970  1635]]
sensitivity score:0.6276
specificity score:0.494
precision score:0.0754
accuracy score:0.5022

4.4 Random Forest

In [46]:
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    Xcat_train, Xnum_train = X_train[:, :26], X_train[:, 26:]
    Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
    
     # standarization
    scaler = StandardScaler()
    Xnum_train_std = scaler.fit_transform(Xnum_train)
    Xnum_test_std = scaler.fit_transform(Xnum_test)
    
    X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
    X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
    
    rf_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, class_weight='balanced', n_jobs=-1)
    rf_clf.fit(X_train_new, y_train)
    y_pred_sub = rf_clf.predict(X_test_new)
    y_true = np.append(y_true, y_test)
    y_pred = np.append(y_pred, y_pred_sub)

rf_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = rf_conf.ravel()
rf_sen = tp/(tp+fn)
rf_spe = tn/(tn+fp)
rf_prec = tp/(tp+fp)
rf_acc = (tp+tn)/(tp+tn+fp+fn)

print ('Random Forest')
print()
print('confusion matrix:')
print(rf_conf)
print('sensitivity score:{:.4}'.format(rf_sen))
print('specificity score:{:.4}'.format(rf_spe))
print('precision score:{:.4}'.format(rf_prec))
print('accuracy score:{:.4}'.format(rf_acc))
Random Forest

confusion matrix:
[[21293 18330]
 [  925  1680]]
sensitivity score:0.6449
specificity score:0.5374
precision score:0.08396
accuracy score:0.544

5 Model Comparision

The classes are highly imbalanced, so we cannot simply count on accuracy to evaluate model performance. We computed four performance scores, including sensitivity, specificity, precision and accuracy. We faced the sensitivity and specificity trade-off when tuning parameters. The goal of the project is to predict patients at high risk of readmission to provide them with special care, so we emphasized the sensitivity score when evaluating model performance.

5.1 Confusion Matrix

In [47]:
def draw_conf_mat(conf_mat, title, ax):
    conf_plot = ax.matshow(conf_mat, cmap=plt.cm.Blues)
    class_names = ['Not', 'Readmit']
    ax.set_xticklabels([''] + class_names,fontdict={'fontsize':'20'})
    ax.set_yticklabels([''] + class_names,fontdict={'fontsize':'20'})
    plt.title(title, fontsize='28')
    plt.xlabel('Prediction', fontsize='20')
    plt.ylabel('Truth', fontsize='20')
    num = conf_mat.ravel()
    plt.text(-.15, 0, str(num[0]), fontdict={'color': 'white', 'size':28})
    plt.text(.85, 0, str(num[1]), fontdict={'color': 'white', 'size':28})
    plt.text(-.15, 1, str(num[2]), fontdict={'size':28})
    plt.text(.85, 1, str(num[3]), fontdict={'size':28})
In [48]:
conf_mat_list = [lr_os_conf, lr_us_conf, blr_os_conf,blr_us_conf,dt_conf, rf_conf]
clf = ['Logistic Regression_OverSampling', 'Logistic Regression_UnderSampling', 
               'Bagging Logistic Regression_OverSampling', 'Bagging Logistic Regression_UnderSampling', 
               'Decision Tree', 'Random Forest']
In [49]:
fig = plt.figure(figsize=(30,20))
for i in range(1, 7):
    ax = fig.add_subplot(int('23{}'.format(i)))
    draw_conf_mat(conf_mat_list[i-1], clf[i-1], ax)
    if i!=1 and i!=4:
        ax.set_ylabel('')

5.2 Performance Scores

In [50]:
clf = ['LR_OS', 'LR_US', 'BagLR_OS', 'BagLR_US', 'DT', 'RF']
sen = [lr_os_sen, lr_us_sen, blr_os_sen, blr_us_sen, dt_sen, rf_sen]
spe = [lr_os_spe, lr_us_spe, blr_os_spe, blr_us_spe, dt_spe, rf_spe]
prec = [lr_os_prec, lr_us_prec, blr_os_prec, blr_us_prec, dt_prec, rf_prec]
acc = [lr_os_acc, lr_us_acc, blr_os_acc, blr_us_acc, dt_acc, rf_acc]

perf_dict = {'classifier': clf, 'sensitivity': sen, 'specificity':spe, 'precision': prec, 'accuracy':acc}
In [51]:
draw_df = pd.DataFrame(perf_dict)
draw_melt_df = pd.melt(draw_df, id_vars=['classifier'],)
In [52]:
fig = plt.figure(figsize=(30,30))
perf_plt = sns.catplot(x='classifier', y='value',col='variable', col_wrap=2,
                       kind='bar', data=draw_melt_df, palette=sns.color_palette('Paired'))
perf_plt.axes[0].set_title('Sensitivity', fontsize='20')
perf_plt.axes[1].set_title('Specificity', fontsize='20')
perf_plt.axes[2].set_title('Precision',fontsize='20')
perf_plt.axes[3].set_title('Accuracy',fontsize='20')
perf_plt.axes[2].set_xticklabels(labels= clf, rotation='vertical',fontdict={'size':20})
perf_plt.axes[3].set_xticklabels(labels= clf, rotation='vertical',fontdict={'size':20})
perf_plt.axes[2].set_xlabel('')
perf_plt.axes[3].set_xlabel('')
perf_plt.axes[0].set_ylabel('Score', fontsize='20')
perf_plt.axes[2].set_ylabel('Score', fontsize='20')
perf_plt.axes[0].set_yticklabels(labels=['0','0.1','0.2','0.3','0.4','0.5','0.6','0.7'], fontdict={'size':20})
perf_plt.axes[2].set_yticklabels(labels=['0','0.1','0.2','0.3','0.4','0.5','0.6','0.7'], fontdict={'size':20})
for index, row in draw_df.iterrows():
    perf_plt.axes[0].text(index-.35, row.sensitivity+.015, round(row.sensitivity,2), fontdict={'size':14})
    perf_plt.axes[1].text(index-.35, row.specificity+.015, round(row.specificity,2), fontdict={'size':14})
    perf_plt.axes[2].text(index-.35, row.precision+.015, round(row.precision,2), fontdict={'size':14})
    perf_plt.axes[3].text(index-.35, row.accuracy+.015, round(row.accuracy,2), fontdict={'size':14})
<Figure size 9000x9000 with 0 Axes>

6 Conclusion

Compared to other readmission prediction work, our classifiers perform fairly. One of the challenges in this project is that readmission data is inherently imbalanced. This gives rise to issues regarding designing the model as well as evaluating the model. Class weights can be modified in certain classifiers like random forest and neural network. When it comes to logistic regression, we employed statistic techniques like down sampling and over sampling. Accuracy is obviously inappropriate here. Therefore we also digged into model's sensitivity, specificity, precision and confusion matrix. Ensemble methods such as random forest and bagging logistic regression turn out to provide better performance than a single classifier.